The MURASAKI Project: Multilingual Natural Language Understanding

نویسندگان

  • Chinatsu Aone
  • Hatte Blejer
  • Sharon Flank
  • Douglas McKee
  • Sandy Shinn
چکیده

This paper describes a multilingual data extraction system under development for the Department of Defense (Do[)). The system, called Murasa.ki, processes Spanish and Japanese newspaper articles reporting AIDS disease statistics. Key to Murasaki's design is its language-independent and domain-independent architecture. The system consists of shared processing modules across the three languages it currently handles (English, Japanese, and Spanish), shared general and domain-specific knowledge bases, and separate data modules for language-specific knowledge such as grammars, lexicons, morphological data and discourse data. This data-driven architecture is crucial to the success of Murasaki as a languageindependent system; extending Murasaki to additional languages can be done for the most part merely by adding new data. Some of the data can be added with user-friendly tools, others by exploiting existing on-line data or by deriving relevant data from corpora. I . I N T R O D U C T I O N Project Murasaki is a 30-month project for DoD to design and develop a data extraction prototype, operative in Spanish and Japanese and extensible to other languages. Using SRA's core natural language processing (NLP) software, SOLOMON, Project Murasaki extracts information from newspaper articles and TV transcripts in Japanese and from newspaper articles from a variety of Spanish-speaking countries. The topic of the articles and transcripts is the disease AIDS. The extracted informationsome in a canonical form and some as it appears in the input texts is stored in an objectoriented database schema implemented in a recently released multilingual version of the Sybase RDBMS. Project Murasaki has been under development since October 1990 and will be delivered to DoD in June 1993. The goal of the project was to extend SOLOMON's data extraction capabilities, hitherto used for English texts, to Spanish and Japanese. It was explicitly requested that Murasaki be as language-independent and domain-independent as possible and be extensible to additional languages and domains ultimately. SOLOMON reflects six years of development. From its inception, language and domain independence have been deliberate design goals. Murasaki was our first extensive use of SOLOMON for languages other than English and thus the first testing-ground for its claimed language independence. SOL()MON had been used and continues to be used across a variety of domains over the past six years. In the MUC-4 conference, SRA demonstrated a single system extracting information about Latin American terrorism from newspaper articles in all three languages, using Spanish and Japanese data modules developed for Murasaki and terrorism vocabulary in Spanish and Japanese acquired in the two weeks prior to the demonstration (cf. [1, 2]). SOLOMON's architecture did not change significantly during the course of Murasaki. For the most part, its claim to language independence was borne out. Below, we will discuss how we have extended it to increase its language independence. 2. U N I Q U E F E A T U R E S O F M U R A S A K I 2.1. Modular Architecture Murasaki is composed of shared processing modules across the three languages supported by separate data modules, as shown in Figure 1. Murasaki has six processing modules: PREPROCESSING, SYNTAX, SEMANTICS, DISCOURSE, PRAGMATICS, and EXTRACT. Each of these modules has associated data. For example: PREPROCESSING: SYNTAX: SEMANTICS: DISCOURSE: PRAGMATICS: EXTRACT: lexicons, patterns, morphological data grammars knowledge bases discourse knowledge sources inference rules extract data Modularity is crucial to the reusability and extensibility of Murasaki. It facilitates, on the one hand, reuse of parts of Murasaki and on the other hand replacement of parts of the system. We have been able to reuse portions of SOLOMON in the past, and expect to be able to pull modules out of Murasaki and use them separately as warranted in the future. For instance, PREPROCESSING could be used in isolation in multilingual information retrieval applications. Conversely, modules both processing and data modules -

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Session 3: Natural Language Evaluation

The session on Natural Language Evaluation focused on methods for evaluating text understanding systems. Beginning with the first Message Understanding Conference (MUCK-l) in 1987, there has been increasing focus on how to measure and evaluate text understanding systems. The MUCK-1 conference required developers to port their system to a common domain of Navy intelligence messages; MUCK-2 (May ...

متن کامل

Different Approaches to Build Multilingual Conversational Systems

The paper describes developments and results of the work being carried out during the European research project CATCH-2004 (Converse in AThens Cologne and Helsinki). The objective of the project is multi-modal, multi-lingual conversational access to information systems. This paper concentrates on issues of the multilingual telephony-based speech and natural language understanding components.

متن کامل

KNOW2: Language understanding technologies for multilingual domain-oriented information access

The goal of the project is to explore integrated environments allowing the cost-effective deployment of vertical information access portals for specific domains. The project started in January 2010, and will last three years.

متن کامل

Towards Development of Multilingual Spoken Dialogue Systems

Developing multilingual dialogue systems brings up various challenges. Among them development of natural language understanding and generation components, with a focus on creating new language parts as rapidly as possible. Another challenge is to ensure compatibility between the different language specific components during maintenance and ongoing development of the system. We describe our expe...

متن کامل

Evaluating Natural Language Generated Database Records

I n t r o d u c t i o n P r o j e c t M U R A S A K I The purpose of Project MURASAKI is to develop a foreign language text understanding system that will demonstrate the extensibility of message understanding technology3 In its current design, Project MURASAKI will process Spanish and Japanese text and extract information in order to generate records in both natural language databases, respect...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1993